<html>
<head>
	<title>Galago Guidebook</title>
	<link rel="stylesheet" type="text/css" href="http://ciir.cs.umass.edu/~strohman/style.css" />
</head>    
<body>
<div id="shrink">

<h4>Warning</h4>

<p>Some of this text is out of date and refers to an older version of Galago.
I've added it to the new website for reference.</p>

<h4>What is Galago?</h4>

<p>Galago is a search engine toolkit primarily research and educational use.  It is released
under a BSD license, so it may be incorporated freely into commercial work, but most
commercial users may want something with less of an experimental focus.</p>
                       
<p>Galago differs from the other open source search packages in primarily its customizability and
scalability.  Galago allows most of the indexing and retrieval process to be changed, 
often without changing any code.  The indexing process is based on TupleFlow, a MapReduce
variant, which can efficiently distribute execution across many processors or machines.
TupleFlow can be used separately from the indexing and retrieval parts of Galago.</p>

<p>Some of the indexes built in Galago can be processed by C++ code for faster retrieval performance.</p>     

<h4>Retrieval Customization</h4>
    
<p>There are two general ways to customize the retrieval process: generating 
custom indexes, and using the query language.</p>

<p>The custom index approach lets you use the flexible indexing tools in Galago to build
specialized indexes that have your own ranking function built-in.  Once you have built one
of these highly customized indexes (usually called binned indexes), you can't change 
the query processing strategy.  However, these binned indexes lead to very fast retrieval
performance.</p>

<p>The slower, but more flexible approach is to use the query language.  Galago includes a 
query language which is a descendent of the Inquery and Indri query languages.  It's 
a little bit like SQL for text.  It allows you to describe how you want documents to be
ranked in a very flexible way.  Since flexibility has a cost, this approach is quite a bit 
slower than the custom index approach.  However, you may not notice the difference unless you
have high query loads or over a million documents to search.</p>

<h4>TupleFlow</h4> 

<p>The indexing component of Galago is based on a framework called TupleFlow.  It is probably best
to think of TupleFlow as a mix between MapReduce, make/ant, and a database system.  TupleFlow
is like MapReduce in that it can efficiently parallelize a large computation.  It is like
make or ant in that it runs based on a file that looks much like a Makefile.  And, it is like
a database system because the glue that holds the computation together are lists of
tuples, just like database tables.</p>                                        

<p>If you are familiar with Makefiles, you know how a Makefile is a series of targets.  Each
target contains a series of rules that builds a particular target.  The targets are linked
together by dependence.  The job of the make program (or ant) is to process the rules and
dependence information to build some software.  For example, to build a Java jar file, you must
first compile Java source code files into Java class files, then combine those class files together
into a jar file.  Therefore, in a build file the final target would be a jar creation stage, but
that target would depend on compilation steps.  With a well designed Makefile, you simply
ask for a jar file, and the build system realizes that code must be compiled first to give you
what you want.</p>

<p>TupleFlow works in much the same way.  Instead of targets, we have stages.  A stage is,
broadly, a series of steps that convert data of some type into data of another type.  Unlike
MapReduce, TupleFlow allows a single stage to have many kinds of inputs and many kinds of outputs.
Also unlike most MapReduce-like systems, which tend to be code-based, TupleFlow expresses
all of the important logic in an XML parameter file.  An example of a TupleFlow stage
description is shown below:</p> 

            <pre>
&lt;stage id="parsing"&gt;
    &lt;dependencies&gt;
        &lt;input id="textfiles" /&gt;
        &lt;output id="counts" /&gt;
        &lt;output id="documents" /&gt;
        &lt;output id="postings" /&gt;     
    &lt;/dependencies&gt;              
	&lt;connections&gt;
		&lt;input class="galago.types.FileName"
		       id="filenames" 
		       order="+filename"/&gt;    
		&lt;output class="galago.types.WordCount"
		        order="+word"
		        id="counts" /&gt;
		&lt;output class="galago.types.DocumentIdentifier"
		        order="+identifier"
		        id="documents" /&gt;
		&lt;output class="galago.types.DocumentWordPosition"
		        order="+document +word +position"
		        id="postings" /&gt; 
		&lt;output class="galago.types.DocumentIdentifierExtent"
			    order="+document"
				id="extents"/&gt;
	&lt;/connections&gt;              
	                   
	&lt;steps&gt;
    	&lt;input id="filenames" /&gt;
        &lt;step class="galago.parse.UniversalParser"
        	   data="textfiles" /&gt;
        &lt;step class="galago.parse.TagTokenizer" /&gt;
        &lt;step class="galago.parse.Stopper"&gt;
              &lt;word&gt;the&lt;/word&gt;
              &lt;word&gt;and&lt;/word&gt;
              &lt;word&gt;or&lt;/word&gt;
              &lt;word&gt;of&lt;/word&gt;
              &lt;word&gt;to&lt;/word&gt;
              &lt;word&gt;a&lt;/word&gt;
        &lt;/step&gt;
        &lt;step class="galago.parse.Porter2Stemmer" /&gt;
        &lt;multi&gt;                    
            &lt;!-- body --&gt;
            &lt;group&gt;
            	&lt;step class="galago.parse.FieldPostingsCounter" /&gt; 
            	&lt;split id="postings" /&gt;
            &lt;/group&gt;

            &lt;!-- language model counts --&gt;
            &lt;group&gt;
            	&lt;step class="galago.parse.WordCounter" /&gt;
            	&lt;step class="galago.tupleflow.Sorter"&gt;
					  &lt;order&gt;+word&lt;/order&gt;
            		  &lt;class&gt;galago.types.WordCount&lt;/class&gt;
				&lt;/step&gt;
            	&lt;write id="counts" /&gt;
            &lt;/group&gt;      

            &lt;!-- identifier extractor --&gt;
            &lt;group&gt;
            	&lt;step class="galago.parse.DocumentIdentifierExtractor" /&gt;
            	&lt;step class="galago.tupleflow.Sorter">
					  &lt;order&gt;+identifier&lt;/order&gt;
            		  &lt;class&gt;galago.types.DocumentIdentifierType&lt;/class&gt;
				&lt;/step&gt;
            	&lt;write id="documents" /&gt;
            &lt;/group&gt;  
        &lt;/multi&gt; 
    &lt;/steps&gt;     
&lt;/stage&gt;  
        	</pre>  

<p>This stage is a parsing stage.  Let's look at how this stage works by starting at the
&lt;steps&gt; tag.  The best way to think of this stage description is to think of data 
flowing through each step, starting at from the input tag and flowing to the output tags. 
From the input step, a list of filenames flows to the UniversalParser.  The parser opens
each file in turn and extracts a stream of documents.  Those documents flow through the 
tokenizer, then the stopper, and then the stemmer.  At this point, the data hits a 
&lt;multi&gt; tag.  A copy of each processed document is sent to each of the three 
&lt;group&gt; tags.  The first one stores word position information that will be stored in
an inverted index.  The second group counts the words in the document so these statistics
can be used later for retrieval purposes.  Finally, the last group just extracts the 
name of the document and stores it in a document names list.</p>   

<p>Most of these tags are step tags.  A step tag loads a Java class and connects it
to a data pipeline.  A Java class needs to implement the Step interface for this to work.
The Processor interface shows how the data passes from one object to another: each
object calls the <tt>process</tt> method of the next object.  If you want, you can create
your own Step classes and add them into your index process.  Galago already comes with
many of the Step classes you'll need.</p>

<p>Before we go further, notice that there is a special step class called Sorter.
Sorter reads in all of its incoming tuples and then sorts them based on the order
specified.  Sorter uses temporary files to do this sorting if there is too much
data to fit in memory.  Soring is a powerful tool for this kind
of text/data processing.</p>  

<p>A sample job file is provided with the toolkit as an example.  The job file shows how 
stages are connected together by connections.  These connections correspond to the 
dependence information in a Makefile.  The connections tell TupleFlow which order to run
the stages, and whether stages can be distributed.  By specifying the workflow in this way,
TupleFlow can automatically distribute work between multiple processors on one machine, 
or many machines on a network (with some help from a job scheduler like Grid Engine or Condor).</p>
                                                 
<p>Notice in the sample that you can define property variables that can be used later in the job file.
This can be useful for repeatedly specifying pathnames.</p>

<h4>Applications</h4>

<ul>
	<li>StructuredRetrieval:  Runs structured queries.</li>
	<li>DumpIndex:  Prints out the contents of binary files.</li>
	<li>JobExecutionContext:  Runs TupleFlow jobs.</li>
</ul>                                             

<h4>Query languages</h4>
                   
<p>There are three types of queries that Galago supports for StructuredRetrieval.
The first is <em>simple</em> mode.  Simple mode queries are parsed by SimpleQuery.
The simple query parser is very permissive of excess punctuation, and allows
the use of quotes to indicate phrases.  An example might look like this:</p>

<tt>"white house" "press conference"</tt>

<p>The second mode is <em>explicit</em> mode.  Explicit mode queries require
almost everything about query processing to be specified within the query.
An example:</p>

<pre>
#combine( #feature:dirichlet:mu=1500( #ordered:width=1( #positions:white()
                                                        #positions:house() ) )
          #feature:dirichlet:mu=1500( #ordered:width=1( #positions:press()
                                                        #positions:conference() ) ) )</pre>
     
<p>The final mode is <em>structured</em> mode.  The structured mode allows the use
of query operators, but without the wordiness of the explicit mode.  An example:</p>

<tt>#combine( #1(white house) #1(press conference) )</tt>
       

<h4>Structured reference</h4>
               
The structured retrieval model works just as it does in Indri, except all the
passage retrieval operators are gone.

<h4>Explicit reference</h4>          
                       
<p>In the explicit query mode, each operator corresponds directly to an iterator used
in the retrieval process.  Each operator takes some number of iterators as arguments,
and returns an iterator itself.  There are three basic iterator types:</p>

<ul>
	<li><tt>ScoreIterator</tt>, which acts like a list of document scores</li>
	<li><tt>ExtentIterator</tt>, which is a document-sorted list of extents (begin and end word positions)</li> 
	<li><tt>CountIterator</tt>, which is a document-sorted list of term counts</li>
</ul>

<p>You can add your own iterator types if you want, but all queries must return a <tt>ScoreIterator</tt>
object.</p>

<h5>List Operators</h5>
                   
The basic query operators are the list operators.  These operators retrieve inverted lists
from the index.  Since these retrieve indexes, they have no iterator arguments.  The operators
are:

<ul>
	<li><tt>#scores</tt></li>
	<li><tt>#extents</tt></li>
	<li><tt>#positions</tt></li>
	<li><tt>#counts</tt></li>
</ul>

<p>The <tt>#scores</tt> operator retrieves a list of document scores from the score index.
A good example of this might be a static document prior like PageRank.  You would refer to 
PageRank like this: <tt>#scores:pagerank()</tt>. </p>

<p>The <tt>#extents</tt> operator is used for fields, like titles or headings.  The operator
<tt>#extents:title()</tt> retrieves a list of title field positions for use in the <tt>#inside</tt>
operator, or all by itself.</p>
   
<p>The <tt>#positions</tt> operator retrieves word position information.  This is useful
when using a proximity operator, like this: <tt>#ordered( #positions:white() #positions:house() )</tt>.  You can use the <tt>#counts</tt> operator for words as well, but the <tt>#counts</tt>
operator returns only counts.  Therefore, it can only be used with operators that only need
term count information, like <tt>#feature:dirichlet</tt>.</p>

<h5>Term Weighting</h5>
                  
<p>To turn term or extent counts into scores (which may be probabilities), we use
term weighting functions.  The standard one is <tt>dirichlet</tt>, which performs
well in practice.  The <tt>dirichlet</tt> method accepts <tt>mu</tt> as a smoothing parameter.
If none is given, the default value of 1500 is used.  The <tt>dirichlet</tt> method also
relies on a <tt>collectionProbability</tt> value which is automatically computed from
the index, but can be overridden within the query.</p>
                                                            
<p>Examples:</p>

<ul>
	<li><tt>#feature:dirichlet( #counts:dog() )</tt></li>
	<li><tt>#feature:dirichlet:mu=2000( #counts:dog() )</tt></li>
	<li><tt>#feature:dirichlet:mu=2000:collectionProbability=0.0000123( #counts:dog() )</tt></li>
</ul>   

Right now only <tt>dirichlet</tt> is supported, but <tt>bm25</tt> and <tt>linear</tt> are
coming soon.

<h5>Standard Operators</h5>
      
<ul>
	<li>#combine</li>
	<li>#scale</li>
	<li>#ordered</li>
	<li>#unordered</li>
	<li>#synonym</li>
	<li>#inside</li>  
	<li>#feature</li>
</ul>          
      
<p><tt>#combine</tt>: Takes many ScoreIterators as arguments.  Adds the score from
    each iterator to form a final document score.  (Note that the probabilistic operators
	are in log space, so this addition is like multiplying probabilities).</p>
	
<p><tt>#scale</tt>: Takes a single ScoreIterator argument, and multiplies its score by
	some constant factor.  The <tt>#scale</tt> and <tt>#combine</tt> operators can be 
	used together to get the same effect as the Indri <tt>#weight</tt> operator.
	Example: <tt>#scale:weight=0.5( #scores:pagerank() )</tt></p>
	
<p><tt>#ordered</tt>:  Like the Inquery/Indri ordered window operator.  Searches for 
	occurrences of its arguments appearing in text less than <tt>width</tt>-1 words apart,
	and order matters.
	<tt>#ordered:width=1( #positions:white() #positions:house() )</tt> searches for
	the phrase "white house".</p>                                     
	                                         
<p><tt>#unordered</tt>:  Like <tt>#ordered</tt>, except the order of the words doesn't matter,
	and the width argument is interpreted differently.  For <tt>#unordered</tt>, the width
	is the entire width of the match (length from the beginning of the first extent to the 
	end of the last extent).</p>

<p><tt>#synonym</tt>:  Returns the union of its ExtentIterator arguments.</p>

<p><tt>#inside</tt>:  Takes two ExtentIterator arguments.  Returns for instances where
	the first extent appears inside the second extent.  Example:  <tt>#inside( #positions:dog() 
	#extents:title() )</tt> searches for the word <tt>dog</tt> appearing in the title of a document.</p>
	
<p><tt>#feature</tt>:  The extensible operator.  This is the operator that can load other classes
	and iterators and add them to the retrieval network.</p>

<h5>Custom Operators</h5>
    
<p>The <tt>#feature</tt> operator can be used to extend the query language with your
own custom operators.  The syntax to use is <tt>#feature:class=MyClass:arg1=val1( #counts:dog() )</tt>.
This operator would load the Java class <tt>MyClass</tt>, passing <tt>arg1=val1</tt> to it as a
parameter, and <tt>#counts:dog()</tt> as a child iterator.  The only restriction is that
<tt>MyClass</tt> must implement the <tt>DocumentDataIterator</tt> interface (which currently
contains no methods). </p>

<p>Example:</p>
   
<tt>#feature:class=edu.umass.MyProximityOperator:width=4( #positions:dog() #positions:cat() )</tt>

<p>This example loads a class called <tt>edu.umass.MyProximityOperator</tt>.  Its constructor
is called with two arguments.  The first argument is a <tt>Parameters</tt> object with 
the key "width" set to "4".  The second argument is an ArrayList containing two ExtentIterators,
one for <tt>dog</tt> and one for <tt>cat</tt>.  If a two argument constructor doesn't exist,
Galago will throw an exception.  Galago also requires that MyProximityOperator implement the
<tt>DocumentDataIterator</tt> interface.</p>

<p>Remember that your operator class must be in your classpath.  This does not mean that 
your new operator needs to be in the Galago JAR file.  You can distribute your custom operators
in your own JAR.  This makes it easy to mix add-on operators from many different people; just
use many different JAR files to store the operators.</p>

<h4>Making a custom index type</h4>
                
<ul>
	<li>Use IndexWriter</li>
	<li>Implement InvertedList in your own subclass</li>
	<li>Use BackedCompressedByteBuffer instances, since they can write temporary data to disk if necessary.</li>
	<li>Use IndexReader to read your completed index.</li>
</ul>

<h4>Making a custom query operator</h4>
                            
<p>You can extend the query language by building your own DocumentDataIterator.  You need
to create a class that implements DocumentDataIterator, and preferably either ExtentIterator,
CountIterator, or ScoreIterator.  A ScoreIterator returns score/probability/belief values.
An extent iterator returns word or phrase extents (begin and end positions in a document).
A CountIterator returns the number of times a paricular event (like a word occurrence)
appears in a document.</p>

<p>Your iterator class must have a constructor that takes a Parameters object as the first
argument, and an ArrayList of DocumentDataIterators as the second argument.  Look at 
OrderedWindowIterator as an example.  The parameters passed between the colons right after
the operator name are passed through the Parameters object, and the iterators between
the parentheses are passed through the childIterators list.  Your job is to use the data
in those child iterators to make a new kind of data.</p>

<p>Once you've made your class, make sure that it is in your classpath, and reference it like this:
<tt>#feature:class=galago.mypackage.MyOperator( #counts:dog() #counts:cat() )</tt>
</p>

<h4>Making a TupleFlow type</h4>
          
<p>The whole TupleFlow concept is based on data items, called tuples, flowing between 
steps.  Any kind of Java object can be used to represent a tuple, as long as it is
never sorted, hashed, or transmitted between stages.  However, any tuple that is sorted, 
hashed, or transmitted needs to be a subclass of galago.tupleflow.Type.</p>    

<p>The galago.tupleflow.Type interface forces the tuple class to provide Comparator
objects for use in sorting, hash functions for hashing, and readers and writers for
file serialization.  While occasionally you might want to write all this code yourself,
it is probably best to let Galago do it for you.  The TemplateTypeBuilder class/program
can help you do this.</p>

<p>Suppose you want a type that represents word counts.  Clearly this tuple needs two
	elements: a string that holds a word, and an integer that holds a count.  You can do it
	like this:</p>
	
<pre>
java galago.tupleflow.TemplateTypeBuilder \
     WordCount.java WordCount org.mytypes \
     String:word int:count
</pre>                                                                        

<p>This command makes a Java class called WordCount in the file WordCount.java within the 
package ord.mytypes.  The class contains a member variable called <tt>word</tt> which is a
String, and <tt>count</tt> which is an integer.</p>

<p>However, that command didn't help us with ordering.  The TemplateTypeBuilder program 
requires that you specify what orders you want this type to support.  Here are some
example orders:</p>

<ul>
	<li><tt>+word</tt> : Sort by word in ascending order</li>
	<li><tt>-word</tt> : Sort by word in descending order</li>
	<li><tt>+word -count</tt> : Sort by word in ascending order.  Break ties by sorting by count in 
							descending order.</li>
</ul>

To allow all of these orderings, you could use this command:
                     
<pre>
java galago.tupleflow.TemplateTypeBuilder \
     WordCount.java WordCount org.mytypes \
     String:word int:count \
     order:+word order:-word order:+word-count
</pre>                                                                             

<p>Notice that there are no spaces in the order strings.  Typically Galago order strings
have spaces in them, but we take them out when using TemplateTypeBuilder.</p>
                     
<p>TemplateTypeBuilder supports a few different column types:</p>

<ul>
	<li>int</li>
	<li>long</li>
	<li>float</li>
	<li>double</li>
	<li>boolean</li>
	<li>String</li>
	<li>bytes (byte[])</li>
</ul>

<p>The last option, <tt>bytes</tt>, is a special case.  It represents a byte array that is
treated like a UTF-8 string.  The reason for using <tt>bytes</tt> instead of <tt>String</tt>
is that <tt>bytes</tt> types are ordered exactly the way that the C++ retrieval expects them.
All index writer classes need to take words in <tt>bytes</tt> mode instead of <tt>String</tt>,
otherwise your C++ retrieval application may crash intermittently.</p> 

<h4>Embedding Galago in another application</h4>

<h4>Using the web interface</h4>

<p>GalagoWeb allows you to quickly put Galago indexes on the web.
GalagoWeb gives you a typical web-style search interface, as well as three REST-style web services that allow you to retrieve search results, highlighted snippets, and full document text.</p>

<p>The current web interface only supports BinnedIndexes, but that 
will change soon.  The required servlet parameters are:</p>

<ul>
	<li><tt>index</tt>: The name of the index you want to serve.</li>
	<li><tt>databaseUrl</tt>: The URL to the database where the document text is stored.</li>
	<li><tt>driverName</tt>: The Java class name of the JDBC driver necessary for connecting to the database.</li>
</ul>

<p>You will want to use the galago.store.DocumentStoreWriter class during index time, since it
can automatically load your indexed documents into your database.</p> 

<p>GalagoWeb needs to run in some kind of Java servlet container.  Apache Tomcat is a popular
option, but others will work too.</p>               

<h4>Running retrievals with Galago</h4>         

<p>Make a structured index, then use the StructuredRetrieval program (more information to follow eventually)</p>
                                                                                                           
<h4>Moving to C++</h4> 

<p>Galago includes a complete C++ retrieval system for binned indexes, and will soon
include an implementation for document-sorted score indexes.  Neither of these support
the explicit structured query language; this is primarily for short, uncomplicated queries
that need to be processed at high speed.</p>

<p>The C++ code that comes with Galago is capable of reading the kinds of indexes that
are built with the IndexWriter class.  Therefore it should be fairly simple to adapt the
existing C++ code to a new type of index that you create yourself.</p>

</div> 
</body>
</html>
